The White Wine dataset is a public available dataset. The dataset is related to white variaants of Portuguese “Vinho Verde” wine.
## [1] 4898 12
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The dataset contains 4898 white wine instances. There are 12 variables. 11 of them are input variables based on physicochemical tests and their data type are numeric.
Input variables (based on physicochemical tests):
Fixed Acidity: Amount of tartaric acid in the wine
Volatile Acidity: Amount of acetic acid in wine
Citric Acid: Amount of citric acid in the wine, it is usually used to add ‘freshness’ and flavor to wines
Residual Sugar: Amount of sugar remaining after fermentation stops
Chlorides: Amount of salt in the wine
Free Sulfur Dioxide: Amount of free form of sulfur dioxide gas, it can prevent microbial growth and the oxidation of wine
Total Sulfur Dioxide: Amount of free and bound forms of sulfur dioxide gas
Density: Density of water
pH: Acidicity or basicity of the wine
Sulphates: Amount of sulphates in wine, it is a wine additive used as an antimicrobial and antioxidant
Alcohol: Percent of alcohol content of the wine
Output variable (based on sensory data):
Quality: Qualiy of the wine
Looking at the histograms, I find that some variables, such as residual sugar, chlorides and alcohol, are right-skewed. I also find that some variables contain outliners. In order to take a closer look, I created some boxplots.
According to the boxplot, it is clear that a number of variables contain outliners. And among these variables, variables such as volatile acidity, chlorides and sulphates contains relatively more outliers than the others.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
##
## 3 4 5 6 7 8
## 0.004083299 0.033278889 0.297468354 0.448754594 0.179665169 0.035728869
## 9
## 0.001020825
From the histogram and the table, I observe that the quality of wines scored from 3 to 9, while the mean is 5.88 and the median is 6.00. I also find that most of the wine are scored 6(44.88%), and only a very small amount of wine are scored 9(0.1%).
I created a new quality variable by converting the integer variable to factor variable. Since the quality of wines scored from 3 to 9, I created 7 levels and named them based on scores, such as “Score 3”, “Score 5”, “Score7”.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
According to the histogram and the table, I find that the percent of alcohol ranged from 8.00%/volume to 14.20%/volume, the mean is 10.51%/volume and the median is 10.40%/volume.
Although 75% of wines have 11.40% of alcohol in wine and the wine that contains that highest percent of alcohol has 14.20% of alcohol, according to the boxplot there does not have any outliners.
The histogram is shifted to the left, which means density variable has at least one outliners. Let’s take a closer look by creating a boxplot.
The boxplot concurs my observaton. It showed that there are three outliers. Thus, I created another histogram that contains only the 1th to 99th percentile of the density level.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The histogram is a little right-skewed. The density of wines ranged from 0.98g/dm^3 to 1.03g/dm^3, the mean and median are 0.99g/dm^3.
Similar with the density of wines, the histogram of the amount of citirc acid in wine is shifted to the left. So I created a boxplot to check if there are any outliers.
The boxplot showed that there are a number of outliers in both tails. Thus, I created another histogram that contains only the 1th to 99th percentile of the amount of critic acid in wine to take a close look of this variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
According to the histogram and the table, I find that the mean of the amount of citric acid in wine is 0.33g/dm^3 and the median is 0.32g/dm^3.
Similar with the density of wine and the amount of citric acid in wine, the histogram of the amount of residual suger in wine is shifted to the left. So I guess it contains at least one outlier.
The boxplot confirmed my guess and there are a few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The above histogram contains only the 1th to 99th percentile of the amount of residual suger in wine and it is heavy right skewed, The histogram and the table show that the mean is 6.39g/dm^3 and the median is 5.20g/dm^3. It consist a very large outliners as the max is 65.8g/dm^3, while 75% of the wine contains only 9.90g/dm^3 of residual suger.
The data set consist 4898 white wine instances and consists 12 variables. The 11 input vairable are numeric variables, while the output variable(quality of wine) is intger variable.
The quality of wine is the main feature as it is an output variable. Moreover, the quality(score) of the wine plays an important role in influencing people’s decision in picking wines.
The density, ph value and the amount of citric acid and residual sugar are also interesting features. Becauses the thickness and flavor of the wine will also influence people’s preference. Moreover, citric acid is usually used to add “freshness” and flavor to wines, thus a right amount of critic acid could increase the quality of the wine.
I created a factor variable for the quality variable. I created 7 levels and named them based on the score. I think it will be a way
The dataset is very tidy and has no missing attribute value, but some of the variables are right-skeded and consist outliers.
I ploted some scatter plots to explore the input varibles’ relationship with quality of wine. The plot showed that the percent of alcohol in wine and pH value seems to have a positive relationship with the quality of wine, the amount of chlorides and the amount of volatile acidity seems to have a relationship with the quality of wine.
The correlation matrix confirmed my guesses that the percent of alcohol in wine and the pH value have a positive relationship with the quality of wine, however the relationship between the pH value and the quality of wine is not as strong as I thought. And althought the correlation matrix concurs with my guess that the amount of chlorides and volatile acidity have a negative relationship with the quality of wine, their relationship are weak. The matrix also showed that the amount of free silfur dioxide and critic acid do not have any relationship with the quality of wine.
Besides, the density of wine has a strong positive relationship with the amount of total sulfur dioxide and residual sugar in wine. The amount of total sulfur dioxide in wine also have a strong relationship with the amount of free sulfur dioxide in wine.
The scatter plot demonstrates that there is the percent of alcohol has a positive relationship with the quality of wine.
## data$quality_c: Score 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.34 11.00 12.60
## --------------------------------------------------------
## data$quality_c: Score 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## data$quality_c: Score 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## data$quality_c: Score 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## data$quality_c: Score 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## data$quality_c: Score 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## data$quality_c: Score 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
The boxplot and the table confirmed my observation, wines with a higher score tends to have a higher percent of alcohol. However, some wines with a score 5 have a very high percent of alcohol too.
##
## Pearson's product-moment correlation
##
## data: data$alcohol and data$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
The correlation test I computed consist with my observation. The result shows that there is a moderate positive correlation between the two variables (r = 0.44, p < 0.05).
The scatter plot shows that the density of wine has a negative relationship with the qaulity of wine.
## data$quality_c: Score 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
## --------------------------------------------------------
## data$quality_c: Score 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
## --------------------------------------------------------
## data$quality_c: Score 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
## --------------------------------------------------------
## data$quality_c: Score 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## data$quality_c: Score 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
## --------------------------------------------------------
## data$quality_c: Score 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
## --------------------------------------------------------
## data$quality_c: Score 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
The boxplot and the table confirmed my observation, wines with a higher score tends to have a lower density.
##
## Pearson's product-moment correlation
##
## data: data$density and data$quality
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
And the result of the pearsone correlation test shows that there is a weak negative correlation between these variables (r = -0.31, p < 0.05).
The scatter demonstrates no relationship bewteen the amout of critic acid and the quality of wine.
## data$quality_c: Score 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2100 0.2575 0.3450 0.3360 0.3850 0.4700
## --------------------------------------------------------
## data$quality_c: Score 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1900 0.2900 0.3042 0.4000 0.8800
## --------------------------------------------------------
## data$quality_c: Score 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2400 0.3200 0.3377 0.4100 1.0000
## --------------------------------------------------------
## data$quality_c: Score 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.338 0.380 1.660
## --------------------------------------------------------
## data$quality_c: Score 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2800 0.3100 0.3256 0.3600 0.7400
## --------------------------------------------------------
## data$quality_c: Score 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.2800 0.3200 0.3265 0.3600 0.7400
## --------------------------------------------------------
## data$quality_c: Score 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 0.340 0.360 0.386 0.450 0.490
The boxplot and the table confirmed my observation, the quality of wine does not seems to have any relationship with the amount of citric acid in wine.
##
## Pearson's product-moment correlation
##
## data: data$citric.acid and data$quality
## t = -0.6444, df = 4896, p-value = 0.5193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03720595 0.01880221
## sample estimates:
## cor
## -0.009209091
The result of the pearson correlation test confirmed my guess. The result shows that there is no correlation between the two variables (r = -0.01, p < 0.05).
The scatter plot demonstratess that the amount of residual sugar has a weak relationship wih the quality of wine.
## data$quality_c: Score 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.588 4.600 6.392 10.700 16.200
## --------------------------------------------------------
## data$quality_c: Score 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## data$quality_c: Score 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## data$quality_c: Score 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## data$quality_c: Score 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## data$quality_c: Score 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## data$quality_c: Score 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
The boxplot and the table show that wine with a higher score tend to have a lower amount of residual sugar.
##
## Pearson's product-moment correlation
##
## data: data$residual.sugar and data$quality
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12524103 -0.06976101
## sample estimates:
## cor
## -0.09757683
I computed a pearson correlation test to access the relationship between these variable. The result demonstrates that there is a very weak negative correlation between thequality of wine and the amount of residual suger in wine(r = -0.10, p < 0.05).
Based on the scatter plots and boxplots, I observated the quality of wine has a positive moderate relationship with the percent of alcohol in wine, a weak negative relationship with the density of wine, a very weak negative relationship with the amount of residual sugar in wine, and no relationship with the amount of citric acid in wine. All of the observations have been confirmed by the result of pearson correlation tests.
The strongest relationship I found is the relationship between the quality of wine and the percent of alcohol in wine. The result of the pearson correlation test shows that there is a moderate positive correlation between the two variables (r = 0.44, p < 0.05). Thus, wines that content a higher percent of alcohol tend to have a better quality.
The scatterplot demonstrated there is a relationship between the density of wine, the percent of alcohol in wine and the quality of wine. Wines with a higher quality tend to have a higher percent of acohol in wine and density of wine ratio.
The scatterplot fails to show any relationship between the density, the pH value and the quality of wine.
Although the scatterplot shows wines with a score 9 tends to have a higher amount of citric acid in wine and amount of sugar in wine ratio, it fails to show any relationship between the amount of residual sugar, citric acid in wine and the quality of wine.
##
## Calls:
## m1: lm(formula = quality ~ fixed.acidity, data = data)
## m2: lm(formula = quality ~ fixed.acidity + volatile.acidity, data = data)
## m3: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid,
## data = data)
## m4: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar, data = data)
## m5: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides, data = data)
## m6: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide, data = data)
## m7: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide,
## data = data)
## m8: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density, data = data)
## m9: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH, data = data)
## m10: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates, data = data)
## m11: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = data)
##
## =========================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## ---------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 6.696*** 7.210*** 7.214*** 7.232*** 7.499*** 7.447*** 7.519*** 237.792*** 275.802*** 283.857*** 150.193***
## (0.103) (0.107) (0.108) (0.108) (0.107) (0.112) (0.111) (8.375) (8.643) (8.633) (18.804)
## fixed.acidity -0.119*** -0.124*** -0.122*** -0.117*** -0.121*** -0.118*** -0.100*** 0.043** 0.161*** 0.166*** 0.066**
## (0.015) (0.015) (0.015) (0.015) (0.015) (0.015) (0.015) (0.015) (0.017) (0.017) (0.021)
## volatile.acidity -1.735*** -1.741*** -1.689*** -1.547*** -1.523*** -1.308*** -1.737*** -1.720*** -1.699*** -1.863***
## (0.122) (0.124) (0.124) (0.122) (0.123) (0.123) (0.116) (0.113) (0.113) (0.114)
## citric.acid -0.037 0.011 0.187 0.175 0.216* 0.108 0.148 0.094 0.022
## (0.108) (0.108) (0.106) (0.107) (0.106) (0.098) (0.096) (0.096) (0.096)
## residual.sugar -0.013*** -0.011*** -0.012*** -0.005* 0.097*** 0.124*** 0.128*** 0.081***
## (0.002) (0.002) (0.003) (0.003) (0.004) (0.005) (0.005) (0.008)
## chlorides -7.797*** -7.871*** -7.029*** -2.118*** -0.477 -0.286 -0.247
## (0.559) (0.561) (0.561) (0.552) (0.554) (0.550) (0.547)
## free.sulfur.dioxide 0.001 0.007*** 0.003*** 0.003*** 0.003*** 0.004***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## total.sulfur.dioxide -0.004*** 0.000 0.000 -0.000 -0.000
## (0.000) (0.000) (0.000) (0.000) (0.000)
## density -233.806*** -277.081*** -285.389*** -150.284***
## (8.503) (8.884) (8.875) (19.075)
## pH 1.236*** 1.162*** 0.686***
## (0.088) (0.087) (0.105)
## sulphates 0.829*** 0.631***
## (0.098) (0.100)
## alcohol 0.193***
## (0.024)
## ---------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.3 0.3 0.3
## adj. R-squared 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.3 0.3 0.3
## sigma 0.9 0.9 0.9 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.8
## F 64.1 133.9 89.3 74.6 100.9 84.6 89.0 184.4 192.6 183.0 174.3
## p 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -6322.8 -6224.2 -6224.1 -6209.7 -6114.2 -6112.9 -6060.9 -5708.8 -5611.2 -5575.5 -5543.7
## Deviance 3791.4 3641.8 3641.7 3620.3 3481.9 3480.0 3406.9 2950.6 2835.4 2794.3 2758.3
## AIC 12651.5 12456.4 12458.2 12431.3 12242.5 12241.8 12139.8 11437.5 11244.4 11175.0 11113.5
## BIC 12671.0 12482.3 12490.7 12470.3 12287.9 12293.8 12198.3 11502.5 11315.9 11253.0 11197.9
## N 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
## =========================================================================================================================================================
I used the multiple linear regression analysis to develop a model for predicting the quality of wine from the 11 input variables(i.e. density, amount of critical aicd, chlorides). It was found that density(beta = -150.28, p < .001), volatile acidity(beta = -1.86, p < .001), amount of alcohol(beta = 0.19, p < .001), pH level(beta = 0.69, p < .001), amount of sulphates(beta = 0.63, p < .001), amount of residual sugar(beta = 0.08, p < .001), fixed acidity (beta = 0.66, p < .01) and amout of free sulfur dioxide(beta = 0.004, p < .001) were significant predictors. The amount of citric acid, chlorides and total sulfur dioxide are not a significant predictor. The overal model fit was R^2 = 0.3.
The scatterplots I created shows that wines that the quality of wine has a postive relationship with the density of wine and a negaitve relatioinship with the percent of alcohol in wine. Wines with a higher quality tend to have a higher percent of acohol in wine and density of wine ratio. However, the scatters fails to show any relationships between the quality, density and pH level of wine, and between the quality of wine with the amount of residual sugar and citric acid in wine.
It is pretty suprise to find that the amount of residual sugar and citric acid do not influence the quality of wine. Because citric acid is usually used to add ‘freshness’ to wines. So it was interesting to find that they have no relationship with the quality of wine as both of them were related with the favor of wines.
I created a linear model to predict the quality of white wine from the 11 input variables. The result shows that density is the best predictor of the quality of wine, and volatile acidity, pH level, fixed acidity and amount of alcohol, sulphates, residual sugar and free sulfur dioxide are also predictors of the quality of wine.
This boxplot demostrated the relationship between the quality of wine and the percentage of alcohol in wine. Although there are a few outliners, such as a number of wines that are scored 5 have a relatively high percentage of alcohol in wine, it shows a pretty clear picture that the wines with a higher quality tends to have a higher percentage of alcohol. The plot also shows that wines that are scored 6 has a wider range of percent of alcohol in wine that wines that have a different score.
This scatter plot shows that relationship between the density and the quality of wine and the percent of alcohol in wine. It domostrates that the density of wine has a negative correlation with the percentage of alcohol in wine. It also shows wines that have a higher score (darker color) tend to have a lower density and percent of alcohol in wine ratio, and wine that have a lower score (lighter color) tend to have a higher density and percent of alcohol in wine ratio.
I choose the correlation matrix because it is able to provide a very clear picture about the relationships between variables. The plot highlighted the correlations that is bigger than +/-0.3. It shows that the quality of wine has a moderate positive relationship wtth alcohol and weak negative relationship with density. The matrix also demostrates that the percent of alcohol in wine is the variable that has the most strong and moderate correlation with others variables. It has a strong negative relationship with density and moderate relationship with the amount of total sulfur dioxide, chlorides and residual sugar in wine.
I became more proficient in exploring and visualizing data with R after working on this exercise. The exercise advance my skills in using histograms, boxplots and scatter plots to present the data and computing pearson correlation test and multiple linear regression analysis to find the relationship between variables. It also taught me how to use correlation matrix, which is a efficient way to explore the data as it is able to provide a very clear picture about the correlation between variables.